Prediction Models: Tutorial

BDSI 2025; University of Michigan

Phil Boonstra

Example: Breast Cancer Diagnosis data

Digitized image of fine needle aspirate (FNA) of breast mass from 569 patients

Figure 2, Street (1993)

Outcome Clinical diagnosis (malignant or benign)

Predictors

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

Data include mean, “standard error”, and worst measurements.

About 35% of diagnoses were malignant

Task: Build a prediction model to predict probability of being malignant given cell characteristics.

  1. Use the training data (‘breast_dx_train.csv’) and the validation data (‘breast_dx_train.csv’) to build and select the model.

  2. You can use logistic regression with any of model building approaches we considered, or something else. You can alternatively build a classifier using a machine learning approach. Note that if you develop a model that only classifies observations, your MSPE, Absolute, and 0-1 loss will all be the same.

  3. Use the training and validation data any way you want, but do not use the test data until you’ve selected one final model. No cheating and no going back to fiddle with the model after you’ve seen the test data!

  4. Evaluate your one model on the test data (‘breast_dx_test.csv’) and report your performance metrics here:

https://forms.gle/6kmfzTPok25hi4v26

Getting started

Original data are available at https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

library(tidyverse) #for read_csv
library(MASS); #stepAIC
library(pROC);#pROC
library(logistf);#logistf
library(glmnet);#glmnet
library(glmnetUtils);#formula interface for glmnet

# 300 randomly selected observations for training
breast_dx_train <- read_csv("https://raw.githubusercontent.com/psboonstra/prediction-lecture/refs/heads/main/breast_dx_train.csv")
# 135 randomly selected observations for validation
breast_dx_validation <- read_csv("https://raw.githubusercontent.com/psboonstra/prediction-lecture/refs/heads/main/breast_dx_validation.csv")
# 134 remaining observations for testing
breast_dx_test <- read_csv("https://raw.githubusercontent.com/psboonstra/prediction-lecture/refs/heads/main/breast_dx_test.csv")

# Build your model...

# Get predictions from your model with: 
test_predictions <-
  predict(my_model, newdata = breast_dx_test, type = "response")
# Get MSPE, Absolute, 0-1 loss, and deviance using code from lecture

# Get AUC
roc(response = breast_dx_test$malignant, 
    predictor = test_predictions)

Results

References

Mangasarian, O.L., Street, W.N. and Wolberg, W.H., 1995. Breast cancer diagnosis and prognosis via linear programming. Operations research, 43(4), pp.570-577.

Street, W.N., Wolberg, W.H. and Mangasarian, O.L., 1993, July. Nuclear feature extraction for breast tumor diagnosis. In Biomedical image processing and biomedical visualization (Vol. 1905, pp. 861-870). SPIE.